MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

Advancing Evaluation of Vision–Language Models with Real-World Multimodal Tables

Published

May 27, 2025

Authors: P. Y. Titiya et al.
Published on Arxiv: 2025-05-27
Link: http://arxiv.org/abs/2505.21771v1
Institutions: Arizona State University
Keywords: multimodal tables, question answering, vision-language models, large language models, benchmark dataset, table reasoning, visual reasoning, machine learning evaluation, real-world data, image-text integration

Random Unsplash-style image

Multimodal tables—combining structured data with visual elements like charts and maps—are increasingly prevalent in real-world domains, yet they pose ongoing challenges for current Vision–Language Models (VLMs) and Large Language Models (LLMs). Existing research has largely concentrated on text-only tables, leaving complex multimodal table reasoning underexplored and lacking robust benchmarks that reflect real-world complexity.

To address these gaps and introduce a comprehensive evaluation tool, the authors propose an innovative approach rooted in the following main contributions:

Building on this approach, the authors report empirical results that highlight current model capabilities and limitations:

Reflecting on these findings, the article draws significant conclusions and outlines future directions for the field: